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Abstract — In this paper we establish lower bounds on informa- 
tion divergence from a distribution to certain important classes 
of distributions as Gaussian, exponential, Gamma, Poisson, ge- 
ometric, and binomial. These lower bounds are tight and for 
several convergence theorems where a rate of convergence can be 
computed, this rate is determined by the lower bounds proved in 
this paper. General techniques for getting lower bounds in terms 
of moments are developed. 



I. Introduction and notations 

In 2004, O. Johnson and A. Barron have proved [ 1| that the 
rate of convergence in the information theoretic Central Limit 
Theorem is upper bounded by c /n under suitable conditions. 
P. Harremoes extended this work in [2| based on a maximum 
entropy approach. Similar results have been obtained for the 
convergence of binomial distributions to Poisson distributions. 
Finally the rate of convergence of convolutions of distributions 
on the unit circle toward the uniform distribution can be 
bounded. In each of these cases lower bounds on information 
divergence in terms of moments of orthogonal polynomials 
or trigonometric functions give lower bounds on the rate of 
convergence. In this paper, we provide more lower bounds on 
information divergence using mainly orthogonal polynomials 
and the related exponential families. 

We will identify x\ with T (x + 1) even when x is not 
an integer. Similarly the generalized binomial coefficient ( a j 
equals x (x — 1) • • • (x — n + 1) jn\ when x is not an integer. 
We use r as short for 2tt. 

II. Moment calculations 

Let {Qp; (3 £ V} denote an exponential family of distribu- 
tions such that the Radon-Nikodym derivative is 

dQp _ exp (/3 ■ x) 



dQo 



Z{P) 



and where T is the set of /3 such that the partition function Z 
is finite, i.e. 



z(P) 



exp (ft ■ x) dQo ( x ) < oo. 



The partition function Z is also called the moment generating 
function. The parametrization j3 — > Qp is called the natural 
parametrization. The mean value of the distribution Qp will 
be denoted /ip. The distribution with mean value fi is denoted 
Q M so that Q MS — Qp. The inverse of the function /3 — > yu,^ is 
denoted p (•) and equals the maximum likelihood estimate of 
the canonical parameter. The variance of x with respect to Q^ 
is denoted V (/x) so that p, — > V (p) is the variance function 
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of the exponential family. This variance function uniquely 
characterizes the exponential family. 

We note that /3 — > In Z (/3) is the cumulant generating 
function so that 

d : \nZ((3) l p =0 = E[X} : 



Var (X) , 



dp 

i 

^lnZ(/3),^ = E[(X-E[X]) 3 



Lemma 1. Let {Qp; f3 e T} denote an exponential family 
with 

dQp _ exp (P ■ x) 



dQo 



Z(P) 



Then 

1) for all fi and is, 



2V (r?) 



for some rj between /i and v. 
2) for all a and /3 £ T, 



D(Q a \\Qp) 



2 



(a-py 



for some 7 between a and /?, and 
Proof: The two parts of the theorem are proved separately. 
1) We consider the function 

g{t) = D(Q t \\Q») 

[Ht)-P{v))-t 



+ lnZlp{p)j - \nZ[p{t) 
The two first derivatives of this function are 

9 '{t)='^t+(p(t)-P{v) 



Z' [Pit)) d/ 3(i) 



<?"(*) = 



Z [p {t) 

P(t)-P (u) , 
1 1 



dt 



dt/dp (t) V (t) ■ 
According to Taylor's formula there exists 7/ between /1 
and v such that 

= g(v) + ( Ji -u)f (u) + \ ( M - vf f" (77) 

_ - v ? 



2V( V ) 



2 



2) The second part is proved in the same way as the first 
part. 



Corollary 2. Let , 

with 



Q 13,13 G r denote an exponential family 
exp (/3 ■ x) 



dQ Z(j3) ■ 
If the variance function of the exponential family is increasing 

\2 



then 



D(Q»\\Q")> 



2V{v) 



for p, < v. 



The binomial distributions, Poisson distributions, geometric 
distributions, negative binomial distributions, inverse binomial 
distributions, and generalized Poisson distributions are expo- 
nential families with at most cubic variance functions 0, J3J. 
Using the former corollary we can provide a lower bound on 
information divergence in terms of the mean. 

Example 3. The variance function of the Gaussian family is 
V (fi) = 1. Hence, with <I> a standard normal random variable 
with probability density ^= exp ( — , 

D(X\\<S>)>±E[X} 2 

if E [X] < 0. 

This inequality actually holds if X is Gaussian with vari- 
ance 1; using the exponential family based on the Gaussian 
distribution with x 2 as sufficient statistics we get the inequality 

\2 



D(X\\$)> 



(Var (X) - iy 
6~ 



if Var (X) < 1. 

The next example is about the exponential distribution. 

Example 4. The Gamma distribution with shape parameter a 
and scale parameter j3 reads 



r Q+ i,0 (x) = 



a\9 



exp 



(-?)■' 



x > 0. 



(1) 



The variance function of the Gamma distribution V (to) = 
-^rr is increasing. Hence 

D(x\\r a+1 ,,)> W% m ? 

a+l 

if E [X] < m. Note that for a = we get the exponential 
distribution as a special case. 

The next example is about the binomial distribution. 

Example 5. The binomial distribution has point probabilities 

bin(n,p,j) = ^ n . ^jp> (1 -p)^ 3 , j = 0,1,2, ••• ,n. 

The variance function is V (m) = m — m2 /n. The variance 
function has maximum for m = ™/2. Hence 

•,2 



if E [X] < np < «/2 or if E [X] > np > "/a. For p = !/2 the 
inequality 

D(Jf||W»(n,p,i))> 



2 (E [X] ;j 



that holds for all random variables. 

The next example is about the Poisson distribution. 
Example 6. The Poisson distributions with point probabilities 



3' 



■exp(-A), j = 0,1,2, •■ 



has variance function V (A) = A, which is increasing. Hence 



D(X\\Po(X)) > 



(EW-A) 2 
2A 



for E [X] < X. 



Example 7. The negative binomial distribution NB (r, p) 
with success probability p and number of failures r has point 
probabilities 

'k + r - E 
k 

Its variance function V (m) = m ("^+ r ) j s increasing. Hence 

\2 



(l-p)V, * = 0,1,2, 



£>(X||7V5(r,p)) > 



(E [X] - to) 



2f (m + r) 

For r = 1 we get the geometric distribution as a special case. 

The next examples involve cubic variance functions. 
Example 8. The inverse Gaussian distribution IG has density 

-AQe-/x) 2 \ 
2^ 2 x J ' 

The variance function V (/i) = ^/x is increasing so 



" A " 


1/2 , 




exp ( 


TX 3 





D(X\\IG(p,X)) > 



X(E[X]-^Y 
2/x 3 



if E [X] < fi. 



D(X\\bin (n,p,j)) > 



(E [X] - npY 
2np (1 — p) 



Similar results hold for the generalized Poisson distributions 
and for the inverse binomial distributions [4|-[7|. 

III. General results for Gamma distributions 

To simplify the exposition in this section we will assume 
that the scale parameters 8 of the Gamma distributions equal 
1. 

A. A conjecture for the Gamma case 
The Gamma distribution reads 

x a 

r a+ i i (x) = — r exp (-x) , x > 0. 
or. 

The Laguerre polynomials are given by the Rodrigues formula 

ry—OLpX r m 



3 



The Laguerre polynomials are orthogonal with respect the 
Gamma distribution, but they are not normalized and they do 
not all have positive leading coefficient. We thus introduce the 
normalized Laguerre polynomials by 



(_-t\ n L n (x) 

\ ' , , , 1/2 



In Example [4] we saw that inequality (Q~|i holds for any random 



variable X satisfying E 



< 0. We conjecture that a 



L? (X) 

similar result holds for the normalized Laguerre polynomials 
of order 2. 

Conjecture 9. For any random variable X and for any k £ N 

we have 



D(X\\T a 



E 



+1,1 



> 



^2 (X) 



(2) 



ifE 



L% (X) 



< 0. 



Lemma 10. Let {Qp; ft £ T} denote an exponential family 
with x as sufficient statistics so that 

AQfi _ exp (J3 ■ x) 



dQ 



Z{p) 



If Ho — and V (0) = 1 and Eq [^ 3 ] > then there exists 
e > such that 



D(Q"\\Q ) > 



It- 



holds for [i £ [—e, 0] . 

Proof: From Lemma [TJ we know that there exists 77 
between p and p such that 

£>(Q1Qo) =D(Q»\\Q»°) 

(/j-/j ) 2 1 



Therefore it is sufficient to prove there exists e > such that 
V (v) < 1 f° r [~ £ i 0] • This follows from the fact that 



dV (77) 



dr/ 

— 



(x-vY 



E 



Var {X) 

where the mean and variance are taken with respect to the ele- 



ment in the exponential family with mean 77. Since 



for /3 = we have that 



E[X 3 } 
Var(X) 



> for f3 in a neighborhood 



Var(X) 

of so V (77) is an increasing of 77 and the result then follows 
from\/(0) = E ■ 
We can now formulate the following result. 

Proposition 11. For all n £ No and all a > —1 we have 



L°(je)) T a+hl (x) dx>0. 



Proof: We have 

L»(x)) T a (x)dx= \ 
x 7 Jo 



l i^3n (L„ (x)) 

3 / 2 r a+M [X) dx 



-1) 



-3/2 



(L" (x)) T a+1>1 (x) dx, 



n + a 

n / Jo 

which, according to (8] p. 57], is strictly positive. 



Theorem 12. For any n £ No and any a > — 1 there exists 
e > f/zaf may depend on a and n such that 



D(x\\r a+1<1 (x))>±(E [i%(xj\ 



LZ (x) 



-e,0] 



for any random variable X satisfying E 

In the Gaussian case, we have the similar 
Corollary 13. For any n £ No there exists e > such that 

D(X\\$)>±(E[H 2n (X)}) 2 

for any random variable X satisfying H2 n (X) £ [— e,0] . 

This inequality has previously been proved by considering 
the Hermite polynomials as limits of Poisson-Charlier poly- 
nomials for which a similar inequality holds J9)- 

IV. Laguerre polynomials of degree 2 
We shall use the following lemma. 
Lemma 14. Assume that 

J [L% (X)j exp (p Q L% (X)j — ex P (-x) dx < 1 (3) 



then the conjecture holds for all E 



L% (X) 



e [A),0]. 



Proof: Let Qp denote the distribution with density 

dQp <*p (/*•£?(*)) 



dE Q+ ii 
We have to prove that 



Z(J3) 



D(Q \\r a+hl )>±(^) 2 . 

forpp £ [ l S ,0].Wehave£>(Q^||r a+1 , 1 )=^-ln(^(/3)) 



and fip 



_ Z'([3) 

ztm 



The inequality is satisfied for f3 = so we 



differentiate with respect to /3 and have to prove that 

dpp Z' (/3) 1 dpp 



> which is equivalent to 



P < H- 



Since we have assumed that pp £ [/?o,0] it is sufficient to 
prove the inequality for (3 £ [fto, 0] . The inequality is satisfied 
for /3 = so we differentiate once more so that we have to 
prove the inequality 

,2 



1 > d^,[j 



Z" (f3)Z(P)-(Z> (ft)' 
(Z(P)f 



4 



Hence it is sufficient to prove that 

Z" (J3) 



< 1 



which is equivalent to 



Z" ((3)<Z((3). 

Since Z (/3) > 1 for all (3 so it is sufficient to prove that 
Z" 03) < 1 for /3 e [/3 ,0]. The function /3 -)■ Z" is 
convex and 



Z" (0) 



2 ^ 



Lt{X)\ —exp(-x)dx = l. 



Therefore it is sufficient to check that Z" (/?o) < 1, which is 
exactly what is stated in (01. ■ 

A. Large values of the shape parameter 

If the scale parameter is fixed at 1 and the shape parameter 
tends to infinity then the Gamma distribution will tend to a 
Gaussian. We know that one can get a lower bound on the 
information divergence in terms of the Hermite polynomial of 
order 2 so we should expect this also to hold for large values 
of the shape parameter. This is indeed the case as stated i the 
following theorem. 

Theorem 15. For any a > 61/2 

D(X\\T a+hl (X))>^(E [l° (X)] 



for any random variable X satisfying E 



in (*) 



< 0. 



Proof: Let (3q denote the negative solution to the equation 
j3 2 exp (P 2 ) = 1. The value is approximately /3 = -0.75309 . 
The function / (x) = x 2 exp ((3qx) is decreasing for x £ 
]— oo,0], increasing for x G [0, — 2 /^ ]and decreasing for 
x e [— 2 /,3o,oo[. The local maximum in x — — 2 /p has the 
value 0.9545 < 1. According to the definition of /?o we have 
/ (A)) = 1 so f / (x) < 1 for x > (3q. The second normalized 
Laguerre polynomial is 



(2+q)(l+q) 
2 



1/2 



_ x 2 - 2 (a + 2) x + {a + 2) (a + 1) 
~ (2(2 + a)(l + a)) V2 ' 

The minimum is attained for x = 2 (a + 1) and has the value 

1 



1/2 



1 



1 



1/2 



This is an increasing function of a that tends to —2 1 / 2 > /?o 
for x tending to oo. We solve the equation 



1/2 



1/2 



A) 



and get 



a = 



2f3 2 - 1 



- 1 = 6.4466 . 



Therefore L" (x) > (3 for all a; if a > a®. Hence 
/ ^Z," (x)j < 1 for all x if a > a>o. In particular 

~ L 2 ( x )) exp (Poll (x)^j ^-j exp (-x) dx 



f[L% (as) ^exp(-x) das < 1. 



This proves that the inequality holds whenever E L^iX) 

[/3o,0]. The condition E L% (X) > /3o is automatically 
fulfilled if a > ao and the theorem follows. ■ 

B. Chi square distributions 

The x 2 -distributions are Gamma distributions with half 
integral value of a and scale parameter l/a. We will check our 
conjecture for a < 6 1 /2 and half integral values. For notational 
convenience we will assume that the shape parameter is 1 and 
note that results for x 2 -distributions are obtained by a simple 
scaling. According to Lemma [14] it is sufficient to calculate 
the integral (01 when 



/? = mirilg (X) = -2- 1/2 

X 

T he results are given in the foll owing 



1 



1 



o 



1/2 



a 


/3o 


J 


-1/2 


-1.225 


0.95407 





-1 


0.63113 


1/2 


-0.9129 


0.55406 


1 


-0.8660 


0.52046 


11/2 


-0.8367 


0.5018 


2 


-0.8165 


0.48997 


21/2 


-0.8018 


0.48181 


3 


-0.7906 


0.47584 


31/2 


-0.7817 


0.47128 


4 


-0.7746 


0.46769 


41/2 


-0.7687 


0.46478 


5 


-0.7638 


0.46238 


51/2 


-0.7596 


0.46037 


6 


-0.7559 


0.45865 



conjecture holds for all half integral values of a. This gives 
us the following theorem. 

Theorem 16. Assume that a > — 1 and that 2a is an integer. 



Then, for any random variable satisfying E 
have 

1 



L2 (X) 



< 0, we 



D(x\\r a+hl )>-(E ]it (x)]) 



Example 17. For a 

with density 



we get the exponential distribution 

exp (— x) , x > 0. 



It 2 

2 



The Laguerre polynomial of order two is L 2 (x) 
2x + 1. We will rewrite our inequality in terms of mean and 
variance. For any random variable satisfying Var (X) < 1 
and E [X] = 1 we get the inequality 

D {X\\Exp (1)) > - (Var (X) - l) 2 . 



5 



The ^-distribution with 1 degree of freedom corresponds 
to a Gamma distribution with shape parameter a + 1 = 1 /2 
and scale parameter 2. It has density 

■ exp 



T l/2 



This distribution is important because it is the distribution of 
the square of a standard Gaussian random variable. Hence, 
results for the \ 2 distribution translate into results for Hermite 
moments. In order to follow the notation from the previous 
section we first prove results for the Gamma distribution with 
shape parameter a + 1 = 1 /2 and scale parameter 1 and then 
translate the results. 
We have 

— V2 / \ *^ ^ ^ 

-Ln [X ) = X H 

2 v ' 2 2 8 



and the normalized version 



LI 



12 (*) = 



3cc 



(a/ a )V- 

This gives us the following theorem. 

Theorem 18. For any random variable satisfying 



E 



l; 



-1/2 



< 0, we have 



D(X\\Y 



i 2,1) > 5 ( E 



-1/2 



(X) 



Corollary 19. For a random variable X satisfying E [X] 
and Var (X) < 2 we have 



D(X\\xl)> 



(Var (X) - 2Y 
48 



Proof: The result follows from the following computa- 



tion. 



D{X\\ X \ 




= -(VarpO + E[^-6ELY] 
_ (Var (X) - 2) 2 



48 



These inequalities can be translated into inequalities for 
Hermite polynomials. 

Corollary 20. For any random variable satisfying 
E [H A (X)} <0we have 

£(X||$)>i(E[ff 4 (V)]) 2 . 

If Var (X) = 1 this is equivalent to 



V. Counterexample 

With all these positive results in mind one may conjecture 
that 



E 



(5) 



D(X\\T a+hl )> 

would hold for all k as long as E L% (X) < 0, but this is not 
the case. Here we will describe a counterexample for k = 3 

—3. In this case the 



and a = —1/2. We will fix E L%(X) 
information projection of ^i/ 2 ,i onio me set of distributions 
satisfying E L% (X) 
density 



-3 equals the distribution Qp with 



dTi 



(x) 



exp 



(plf (x) 



Z 2 ' 1 ' / °° exp (pLf (x) r Q+M (x) dx) ' 

Numerical calculations gives /3 = —1.83125 and 
D (Qp\\Ti/ 2 1) — 3.3195, which is not greater than 
h (— 3) 2 . The counterexample implies that there exists a 
random variable X such that 

D(X\m^l(E[H 6 (X)}) 2 . 
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D(X\\$) > 



48 



(4) 



if X is platykurtic and k denotes the excess kurtosis. 

The inequality was proved in J2] with a different 
technique. 



